30 research outputs found

    Strategies for Searching Video Content with Text Queries or Video Examples

    Full text link
    The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search. However, metadata is often lacking for user-generated videos, thus these videos are unsearchable by current search engines. Therefore, content-based video retrieval (CBVR) tackles this metadata-scarcity problem by directly analyzing the visual and audio streams of each video. CBVR encompasses multiple research topics, including low-level feature design, feature fusion, semantic detector training and video search/reranking. We present novel strategies in these topics to enhance CBVR in both accuracy and speed under different query inputs, including pure textual queries and query by video examples. Our proposed strategies have been incorporated into our submission for the TRECVID 2014 Multimedia Event Detection evaluation, where our system outperformed other submissions in both text queries and video example queries, thus demonstrating the effectiveness of our proposed approaches

    Multiface: A Dataset for Neural Face Rendering

    Full text link
    Photorealistic avatars of human faces have come a long way in recent years, yet research along this area is limited by a lack of publicly available, high-quality datasets covering both, dense multi-view camera captures, and rich facial expressions of the captured subjects. In this work, we present Multiface, a new multi-view, high-resolution human face dataset collected from 13 identities at Reality Labs Research for neural face rendering. We introduce Mugsy, a large scale multi-camera apparatus to capture high-resolution synchronized videos of a facial performance. The goal of Multiface is to close the gap in accessibility to high quality data in the academic community and to enable research in VR telepresence. Along with the release of the dataset, we conduct ablation studies on the influence of different model architectures toward the model's interpolation capacity of novel viewpoint and expressions. With a conditional VAE model serving as our baseline, we found that adding spatial bias, texture warp field, and residual connections improves performance on novel view synthesis. Our code and data is available at: https://github.com/facebookresearch/multifac

    Surveillance Video Analysis with External Knowledge and Internal Constraints

    No full text
    The automated analysis of video data becomes ever more important as we are inundated with the ocean of videos generated every day, thus leading to much research in tasks such as content-based video retrieval, pose estimation and surveillance video analysis. Current state-of-the-art algorithms in these tasks are mainly supervised, i.e. the algorithms learn models based on manually labeled training data. However, it is difficult to manually collect large quantities of high quality labeled data. Therefore, in this thesis, we propose to circumvent this problem by automatically harvesting and exploiting useful information from unlabeled video based on 1) out-of-domain external knowledge sources and 2) internal constraints in video. Two tasks in the surveillance domain were targeted: multi-object tracking and pose estimation. Being able to localize and identify each individual at each time instant would be extremely useful in surveillance video analysis. We tackled this challenge by formulating the problem as an identity-aware multi-object tracking problem. An existing out-ofdomain knowledge source: face recognition, and an internal constraint: the spatialtemporal smoothness constraint were used in a joint optimization framework to localize each person. The spatial-temporal smoothness constraint was further utilized to automatically collect large amounts of multi-view person re-identification training data. This data was utilized to train deep person re-identification networks which further enhanced tracking performance on our 23-day 15-camera data set which consists of 4,935 hours of video. Results show that our tracker has the ability to locate a person 57% of the time with 73% precision.  Reliable pose estimation in video enables us to understand the actions of a person, which would be very useful in surveillance video analysis. However, domain differences between surveillance videos and the pose detector’s training set often cause degradation in pose estimation performance. Therefore, an unsupervised domain adaptation method based on constrained self-training was proposed. By utilizing an out-of-domain imagebased pose detector (external knowledge) and spatial-temporal smoothness constraints (internal constraints), our method can automatically collect in-domain pose estimation training data from video for domain adaptation. Results show that the pose detector trained on in-domain data collected with our unsupervised approach is significantly more effective than models trained on more out-of-domain data.  Finally, based on our improved multi-object tracker and pose detector, long-term analyses of nursing home resident behavior were performed. Results show that the output of our tracker was accurate enough to generate for each nursing home resident a reasonable “visual diary”, which not only shows the activities performed throughout the day, but also accumulated long-term statistics which are simply too tedious to compute manually. Furthermore, pose detectors were utilized to detect eating behavior of nursing home residents, which would also have the potential to aid the assessment of health status of nursing home residents.  In conclusion, our results demonstrate the effectiveness of utilizing external knowledge and internal constraints to enhance multi-object tracking and pose estimation. The methods proposed all attempt to automatically harvest useful information directly from unlabeled videos. Based on the promising experimental results, we believe that the lessons learned could be generalized to other video analysis problems, which could also benefit from utilizing external knowledge or internal constraints in an unsupervised manner, thus reducing the need to manually label data. Furthermore, our proposed methods potentially open the door to automated analysis on the ocean of surveillance video generated every day. </p
    corecore